THE ASF CONFERENCE 


When Hardware Fails, 


en Ф, Ин L \ 
; га. | 
DEE SO _ Ozone Prevails 
Д ы т в № 

$ ” ее " | 

© | 
ANJI T У e Ethan Rose 

У | Senior Software Engineer, Cloudera Inc 
" “ы” ы Apache Ozone Committer, РМС 


Failure Domains 


QUO — — (à "UODDOCOCCUCICHYCIOCHICICIOCIOCIO,. 


MITTTTETITITITITIITITTTITITTTTITV 


CLOUD-ZRA CoC 2023 NA Oct. 7 — Oct. 10 


Failure Domains 


* Network failure 


ИИ 


MITTTTETITITITITIITITTTITITTTTITV 


CLOUD-ZRA CoC 2023 NA Oct. 7 — Oct. 10 


Failure Domains 


* Network failure 


* Node failure 


ИИ 


MITTTTETITITITITIITITTTITITTTTITV 


CLOUD-ZRA CoC 2023 NA Oct. 7 — Oct. 10 


Failure Domains 


* Network failure 


* Node failure 


ИИ 


* Disk failure 


MITTTTETITITITITIITITTTITITTTTITV 


CLOUD-ZRA CoC 2023 NA Oct. 7 — Oct. 10 


Failure Domains 


* Network failure 


* Node failure 


one 3 "luunununuuuunuunuunuaup, 
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* Bit rot/corruption 


CLOUD-ZRA CoC 2023 NA Oct. 7 — Oct. 10 


Ozone Architecture 


CLOUD-RA CoC 2023 NA Oct. 7 — Осі 10 


Ozone Architecture 


% 
% 
% 
% 
в 
ж 
% 
| а 
| в 
| Г 
| a 
3 a 
| * 
| * 
* 
| Ф 
| * 
ж ; 
ж 
М р 
% , 
| ж 
| ж 
% А 
| ж 
4, р 
| ж 
| ж 
* А 
ғ, " 
"a, » 
“. » 
“an, “© 
ENTE “т 
ча ва 
"апанын ерке 
“AA naro an 


CoC 2023 NA Oct. 7 — Oct. 10 
CLOUD=RA 


Ozone Architecture 


CLOUD-RA CoC 2023 NA Oct. 7 — Осі 10 


Ozone Architecture 


CLOUD-RA CoC 2023 NA Oct. 7 — Осі 10 


Ozone FileSystem E Ого 8 e Arch ite ctu ГЄ 


CLOUD-RA CoC 2023 NA Oct. 7 — Осі 10 


Ozone FileSystem API 


Ozone Architecture 
Б 


Metadata Operations 


CLOUD-RA CoC 2023 NA Oct. 7 — Осі 10 


ae EED Ozone Architecture 


Metadata Operations Data Transfer 


Block Storage Layer 


ey MEA — s о 


ae EED Ozone Architecture 


Metadata Operations Data Transfer 


Block Storage Layer 


Monitoring “-. 


ey MEA — e о 


ae EED Ozone Architecture 


Metadata Operations Data Transfer 


Block Storage Layer 


Monitoring “-. 


ey MEA — e о 


Ozone Manager (OM) 


* Store Metadata Only 
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Ozone Manager (OM) 


* Store Metadata Only Leader «ара 
* Volumes, buckets, keys, ACLS / “Хх 
- Data replicated with Apache Ratis — M 


(Raft Implementation) 
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Ozone Manager 
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OM: Disk Failure/DB Corruption 


* OM Stores data in 2 disks: 


1. Ratis Logs 
2. RocksDB (Ratis state machine) ша 
| | | | Ozone Manager 
e Failure to write to disk or DB is fatal E 7: 
. All previous transactions must be o MS 
applied before the current one. 
` RAID | recommended Follower Follower 
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OM Faults: Relevant Config Keys 


ozone.om.ratis.storage.dir Disk for Ratis logs 
raft.server.leaderelection Time after the last heartbeat where the 
.leader.step-down.wait-time leader will step down. 
| | Lower bound of the random timeout for а 
raft.server.rpc.timeout.min | 
follower to initiate a leader election. 
| Upper bound of the random timeout for a 
raft.server.rpc.timeout.max A | 
follower to initiate a leader election. 
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- Tracks metadata of storage 
containers (unit of replication) 


Leader 


* Write pipelines, replica information, 
cluster balancing, replication, etc. 


: Data replicated with Apache Ratis 
similar to OM 
Follower 


* SCM failures are handled in the Follower _ Storage Container Manager | 
_ Storage Container Manager | 


same way as OM 
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SCM Faults: Relevant Config Keys 


ozone.scm.ha.ratis.storage.dir Disk for Ratis logs 
raft.server.leaderelection Time after the last heartbeat where the 
.leader.step-down.wait-time leader will step down. 
| | Lower bound of the random timeout for 
raft.server.rpc.timeout.min s | 
a follower to initiate a leader election. 
| Upper bound of the random timeout for 
raft.server.rpc.timeout.max De | 
a follower to initiate a leader election. 
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SCM # Datanodes 


e Datanodes periodically heartbeat to 


* Leader SCM makes decisions 
based on datanode health. 


* Leader SCM responds to heartbeats 


with commands for data nodes 
_ Datanode ^ Datanode | 
* Replicate data, delete blocks, etc. | Datanode | | Datanode | 
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SCM # Datanodes 
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* Leader SCM responds to heartbeats 
with commands for data nodes 


* Replicate data, delete blocks, etc. 


CLOUD-RA CoC 2023 NA Oct. 7 — Осі 10 


SCM + Datanodes + Clients: Network Failure 
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e Data replication is not yet triggered И . Datanode | 
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SCM + Datanodes: Node Failure 


* When а stale datanode misses 
further heartbeats, SCM marks it as 
dead. 
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SCM + Datanodes: Node Failure 


. When a stale datanode misses 
further heartbeats, SCM marks it as 
dead. 


- Trigger replication/reconstruction 
from other nodes. 
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SCM to Datanode Communication 


Interval between datanode to SCM 
heartbeats 


hdds.heartbeat.interval 


Timeout for SCM to mark a datanode 


ozone.scm.stale.node. interval | | | 
stale if no heartbeat is received 


Timeout for SCM to mark a datanode 


ozone.scm.dead.node. interval dead if no heartbeat is received 
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* RocksDB per volume holds metadata 
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Volume Scanner: Directory Checks 


e Goal: Identify configuration issues 


* Disk mount point exists 


* Disk mount point has correct 
permissions 
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1. Save random byte string in memory 110110101... 
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2. Write to tmp file с Е 
3. Sync file — 
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4. Read back from file № 
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Volume Scanner: Disk Checks 


ndds. datanode | disk .check.10o, How large of a file to use in the disk 
Tile.size 
/О check 
hdds.datanode.disk.check. | | | 
| Maximum time a disk scan can take 
timeout T | | 
before it is considered failed 
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* Dont mistake intermittent ІО errors for full disk failure 


* Disk check sliding window: 2 of the last 3 disk checks must pass 


ae Re-check same disk over time 
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Volume Scanner: Iterating Disk Checks 


hdds.datanode.disk.check.10. 
test.count Size of disk check sliding window 


hdds.datanode.disk.check.1o. 


ое Number of checks in the window that 


can fail without volume being failed 


hdds.datanode.periodic.disk. How frequently the background 
check. interval.minutes disk checker runs 
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* Fast check, can be done more 
frequently 
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Container Scanner: Metadata Checks 


* Fast check, can be done more 
frequently 


e Is the container's directory and 
metadata still present and readable? 
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Container Scanner: Metadata Checks 


hdds.container.scrub.metadata 
scan. interval Freguency the background 
container metadata scanner runs 
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Data Checks 


Container Scanner: 


* Slow check 


* Eventually touches every byte on disk 


* May run continuously in the 


backg rou nd Storage Containers (5Gb each) RocksDB 
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Container Scanner: Data Checks 


* Slow check 
* Eventually touches every byte on disk 


* May run continuously in the 
background 


e Superset of metadata checks 
* Do block checksums match? 


- Does RocksDB metadata match 
blocks? 
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Container Scanner: Data Checks 


hdds.container.scrub.data.scan Freguency the background 
“Interval container data scanner runs 
hdds.container.scrub.volume Bandwidth throttle for background 
.bytes.per.second container data scanner 
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Container + Volume Scanner: Scheduling Scans 


* Background Scans: Ensure data is intact even when not accessed. 
* On-Demand Scans: Triggered when failure is encountered. 


* Background scans may take a while to get to this data otherwise. 
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Container + Volume Scanner: On-Demand Scans 


Client retries on a different datanode's replica 


On-demand disk scan gueue 
LoL 


LP On-demand container scan queue : е Containers (5Gb each) RocksDB 
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Container + Volume Scanner: On-Demand Scans 


hdds.container.scrub.data.scan Freguency the background 
interval container data scanner runs 
hdds.container.scrub.on.demand Bandwidth throttle for on-demand 
.Volume.bytes.per.second container data scanner 
hdds.container.scrub.min.gap Minimum gap between consecutive 
scans of the same container 
ndds.datanode.disk.check.min. | Minimum gap between consecutive 
бар scans of the same volume 
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Recovering From Container Corruption 
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Recovering From Container Corruption 


Report Unhealthy 
Container 2 
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Recovering From Container Corruption 


Report Unhealthy 


Replicate Container 2 


Container 2 
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Recovering From Container Corruption 
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Recovering From Container Corruption 


Report Unhealthy 


Report 
Container 2 Replicate Container 2 
replicated Container 2 Delete Unhealthy 


Container 2 


CLOUD=ER CoC 2023 NA Oct. 7 — Oct. 10 


DN Faults: Relevant Config Keys 


“Interval manager runs 
hdds.scm.replication.datanode Limits the number of replication 
.replication. limit commands gueued per data node 
hdds.scm.replication.inflight Limits the total amount replication 
.limit.factor happening among the cluster 
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Client + Datanode: Write Corruption 
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Client + Datanode 


1. Client calculates checksum for data 


2. Client sends ааа + checksum to 
datanode 
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Client + Datanode: Write Corruption 


1. Client calculates checksum for data 


2. Client sends ааа + checksum to 
datanode 


3. Datanode verifies checksum 
matches data 
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Client + Datanode: Write Corruption 


1. Client calculates checksum for data 


2. Client sends ааа + checksum to 
datanode 


3. Datanode verifies checksum 
matches data 


4. Datanode stores checksum and 
data 
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Client + Datanode: Write Corruption 


1. Client calculates checksum for data 


2. Client sends ааа + checksum to 
datanode 


3. Datanode verifies checksum 
matches data 


4. Datanode stores checksum and 
data 


5. Datanode acks to client 
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Client + Datanode: Read Corruption 


1. Client reguests data 


2. Datanode retrieves existing 
checksums and data 


3. Datanode sends checksums 
and data to client 


4. Client verifies checksum 
matches data 
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Future Improvements 


* Write checksum performance improvements (WIP) HDDS-/228 
* |dentifying and avoiding slow datanodes (soft failures) 

* Store checksum inside block files (for zero copy support) 

e More robust disk failure detection 

e Periodically verify RocksDB read checksums for metadata 


* Use metrics from deployments to better tune default configs 
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Future of Data Storage User Group Meetup 


* October 25, 2023 
* 4-7 PM PST 


* Santa Clara, California + Online 


Uber CLOUD=RA 


& ICEBERG 
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